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Abstract 


We  present  designs  for  VLSI  circuits  computing  Cyclic  Shifts,  Discrete 

Fourier  Transforms,  and  Integer  Multiplication,  all  based  on  a  machine  architecture, 

the  Cube  Connected  Cycles  CCC,  introduced  by  the  authors  in  riO],  All  of  our 

designs  match,  to  within  a  constant  factor,  the  known  theoretical  lower  bounds 
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13],  [4],  L 8 j  for  area  *  (time)  products. 
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1.  INTRODUCTION 


Very-Large-Scale  integration  (VLSI)  is  revolutionizing  the  methodologies 
of  digital  system  design.  The  traditional  criteria  of  component  count  -whether 
applied  to  processors  or  to  simpler  devices-  are  no  longer  adequate  to  establish 
a  scale  of  comparison  among  various  solutions  to  a  given  problem.  Indeed  number- 
of-elemencs  criteria  are  substantially  based  on  the  fact  that  processing 
elements  and  their  interconnections  are  realized  by  different  media.  This 
difference  disappears  in  VLSI,  which  "integrates"  both  processing  elements  and 
their  interconnection  in  a  two-dimensional  geometry,  the  surface  of  the  silicon 
chip.  Thus,  a  meaningful  figure-of -merit  is  represented  by  the  area  occupied  by 
the  total  system,  since  area  captures  the  complexity  of  both  computation  and 
data  communication.  As  a  result,  the  solution  to  a  given  computational  problem 
involves  the  conception  of  an  interconnection  architecture,  its  layout,  and  the 
design  of  an  algorithm  for  that  architecture.  It  is  clear  that  the  traditional 
hardware-software  antinomy  disappears  in  VLSI  technology. 

Pioneering  and  fundamental  work  in  the  area  has  been  done  by  Mead-Conway  [  1  ] 
and  by  Thompson  [2],  both  as  regards  the  development  of  a  VLSI  model  of  compu¬ 
tation  and  in  the  design  of  computations  (architecture+algorithms)  for  specific 

problems.  As  is  typical  of  the  methodology  of  concrete  computational  complexity, 

* 

for  a  given  problem  and  selected  complexity  measures  one  seeks  both  lower-bounds 
to  these  measures  which  hold  for  any  realization  in  the  computation  model  and 
upper-bounds  by  exhibiting  explicit  realizations  which  comply  with  the  model. 

In  spite  of  the  relative  novelty, the  great  interest  of  the  topic  is  attested  to 
by  the  additional  contributions  of  Thompson  [3],  Abelson  and  Andreae  [4] ,  Ring 
and  Leiserson  [5],  Guibas  et  al.  [63,  Brent  and  Rung  [7,8],  Savage  [9],  and 
Preparata-Vuillemin  [10]. 
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The  VLSI  computation  models  of  Mead-Convay  '1 3  and  Thompson  [2]  are  not 
significantly  different.  We  briefly  recall  the  latter  one  for  the  benefit  of  the 
reader.  A  VLSI  computing  system  (or  network)  is  viewed  as  a  communication  graph, 
whose  vertices  and  edges  are  called  nodes  and  wires,  respectively.  Nodes  store 
and  process  local  information  ;  vires  transmit  information  between  nodes.  Nodes 
and  vires  are  laid  out  on  a  grid  of  unit  squares,  where  "unit"  relates  to  the 
so-called  "feature  width",  a  basic  parameter  characterizing  the  resolution  of 
current  fabrication  techniques.  Wires  have  unit  width  and  must  be  partitionable 
into  no  more  than  v  sets  of  non  intersecting  segments,  where  v  is  the  number 
of  conducting  layers.  In  this  work,  we  assume  that  v»2,  the  almost  universal 
two  layers  standard.  It  is  assumed  that  a  bit  of  information  takes  unit  time  to 
propagate  from  node  to  node,  independently  of  the  wire  length  (this  implies  that 
longer  wires  have  more  powerful  drivers,  of  area  proportional  to  the  wire  length)  ; 
node  processing  time  is  absorbed  into  wire  propagation  time,  and  the  total  time 
for  a  given  computation  is  the  number  of  time  units  to  execute  it. 

The  usual  metric  selected  for  complexity  is  an  area-time  product  AT^a,  where 
A  is  the  chip  area,  T  is  the  computation  time,  and  a  is  a  real  parameter  satis¬ 
fying  OSaSI.  This  metric  allows  a  flexible  trade-off  (based  on  a)  between  the 
production  cost  (area)  and  the  incremental  cost  (time)  of  computation. 

For  several  interesting  problems,  lower-bounds  to  the  area-time  product  have 
been  obtained.  A  crucial  notion  in  obtaining  such  lower-bounds  S  is  the  minimal 
bisection  width  ui  of  a  given  communication  graph  O(V.E),  which  is  defined  as  the 
smallest  integer  such  that  w*l(u,v)eE  ;  ueVj,  vcVjil,  where  {Vj.V,}  is  a  partition 
of  V  with  I Vj |s |Vj Is | V j |+| .  Thompson  has  shown  [23  that  for  any  n-node  communi¬ 
cation  graph  with  minimal  bisection  width  u,  A2oj‘/4  (in  unit  squares).  Therefore, 
lower-bounds  to  AT^a  are  obtained  by  bounding  the  computation  time  T  in  terms  of  u. 
In  this  paper  we  restrict  ourselves  to  the  following  problems  :  cyclic  shifts, 
integer  multiplication,  and  radix-2  Discrete-Fourier-Transform.  As  regards  cyclic 
shifts,  it  has  been  shown  in  [A]  and  [10]  that  Tkn/2u>  for  any  VLSI  design  which 
performs  any  cyclic  shift  of  an  array  of  n  one-bit  terms  ;  using  a  technique  due 
also  to  Thompson  [2,p.  723,  this  leads  to  the  lover-bound  AT^a<2(n1+a) .  Since  the 
ability  to  perform  an  arbitrary  cyclic  shift  of  an  n-bit  string  is  reducible  to 
the  multiplication  of  two  (n/2)-bit  integers,  the  lower-bound  to  AT^a  for  cyclic 
shift  becomes  a  lower-bound  to  integer  multiplication  [4]  ;  however,  an  independent 
proof  of  the  latter  -in  a  slightly  more  general  model-  has  been  supplied  by 
Brent  and  Rung  [S3.  Finally,  as  regards  radix-2  DFT,  Thompson  [23  has  shown  that 
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log^°n)  for  any  communication  graph  which  computes  the  DFT  of  n  numbers 
each  represented  with  O(logn)  bits. 

The  purpose  of  this  paper  is  to  provide  upper-bounds  to  the  chosen  metric 
of  complexity  for  the  problems  mentioned  above.  The  paper  is  organized  as  follows. 
Section  2  reviews  the  structure  and  the  layout  of  a  general  computation  network 
-called  the  cube  connected-cycles  [103-  which  is  remarkably  suited  to  VLSI  design. 
Section  3  and  4  discuss  optimal  designs  based  on  the  cube-connected-cycles  ; 
specifically  Section  3  considers  a  network  for  cyclic  shifts,  while  Section  4 
considers  networks  for  integer  multiplication  and  radix-2  FasfFourier-Transform. 

2.  THE  CUBE-CONNECTED-CYCLES 

The  cube-connected-cycles  (CCC)  interconnection  has  been  proposed  in  [10] 
as  a  general-purpose  network  of  processing  modules,  suited  for  the  implementation 
of  various  combinatorial  algorithms.  The  specifications,  the  operating  modes, 
and  the  performance  of  the  CCC  are  now  briefly  reviewed. 

An  h*2*  CCC-interconnection  consists  of  2*  cycles,  indexed  from  0  to  2s- 1  ; 
each  cycle  is  the  circular  interconnection  of  h  modules  (his),  indexed  from  0  to 
h-1 .  Thus  each  module  is  addressed  by  a  pair  (£,p),  where  l  and  p  are  respectively 
the  cycle  and  the  module  indices,  and  is  denoted  M[£,pj.  Each  module  has  three 
ports  :  F,  B,  and  L,^  and  the  connection  is  completely  specified  by 

F(4,p)  B(£,  (p+1  )mod  h) 

Ba,p)~ Fa,(p-l)mod  h) 

L(£ ,p)  *■*  L(£*e2^,p)  if  pSs-l  (unconnected  if  pis) 

where  e>l-2BIT  (l)  (here  BIT  (l)  is  the  coefficient  of  2P  in  the  binary  expansion 
P  P 

of  i) .  In  the  hypothesis  chat  modules  reduce  to  nodes  (i.e.,  they  can  be  placed 
at  vertices  of  a  uniform  grid  of  squares)  and  that  wires  are  laid  out  on  the 
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grid,  a  layout  of  a  6*2  CCC  is  shown  in  figure  1.  Notice  that  if  all  nodes  of 
every  cycle  arc  ideally  collapsed  into  a  single  node,  the  resulting  set  of  nodes 
are  connected  as  a  binary  s-diaensional  cube  (s-cube) .  This  justifies  the  CCC 
denotation.  (In  the  layout  of  figure  1,  vertical  and  horizontal  wires  realise  the 
cycle  and  cube  connections,  respectively). 


(1)  F,  B,  and  L  are  respectively  mnemonic  for  "forward",  "backward",  and 
"lateral". 


u 


I  I? 

i  l  J|$ 

9!?  9  9 


Figure  1.  A  standard  layout  for  an  h*2"  CCC  (h«6,  s«4) 


The  dimensions  of  this  s-cube  are  numbered  l,2,..,,s,  and  the  set  of  horizontal 
wires  realizing  dimension  i  are  collectively  denoted  as  sheaf  i. 

As  a  paradigm  of  computation,  we  consider  the  following  type  of  algorithms. 
Abstractly,  there  are  n»2r  data  items  (operands) ,  assigned  addresses  from  0  to 
2r-l  (or  equivalently,  each  operand  is  addressed  by  an  r-dimensional  binary 
vector  and  assumed  to  be  placed  at  the  corresponding  vertex  of  the  r-cube) .  The 
algorithm  is  a  sequence  of  r^logjti  steps  -each  executable  in  parallel-  with  the 
property  that  at  each  of  these  steps  each  operand  interacts  with  another  operand, 
which  is  adjacent  to  it  one  a  specified  r-cube  dimension  ;  specifically,  either 
the  i-th  (ASCEND  type  algorithms)  or  the  (r-i)-th  dimension  (DESCEND  type 
algorithms)  pertains  to  step  i.  (Typical  instances  of  such  algorithms  are  the 
Radix-2  Fast-Fourier-Transform  and  Bitonic  merging  of  sorted  sequences.)  We  see 
that  these  algorithms  are  supported  by  an  r-cube  interconnection. 

We  now  show  how  algorithms  of  the  type  just  described  can  be  implemented  on 
a  2px2r  P  CCC.  Processing  occurs  in  two  consecutive  phases.  Making  reference  for 
concreteness  to  the  ASCEND  mode  of  operation,  the  first  phase  (referred  to 
conventionally  as  LOWSHEAVES)  pertains  to  r-cube  dimensions  l,2,...,p  (which 
are  subsumed  by  the  CCC  cycle  connection),  while  the  second  phase  (denoted 
HIGHSHEAVES)  pertains  to  r-cube  dimension  p+1,  p+2,...,r  (which  ordely  cor¬ 
respond  to  CCC  sheaves  l,2,...,p). 
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The  LOWSHEAVES  phase  emulates  in  general  the  r-cube  behavior  as  follows. 

Since  operand-interaction  can  occur  in  the  cycles  only  between  adjacent  modules, 
it  is  necessary  to  successively  realize  the  adjacencies  corresponding  to  p-cube 
dimensions  l,2,...,p.  The  key  permutation  for  this  task  is  the  perfect  unshuffle 
Ill],  and  it  is  shown  in  [10]  that  the  required  adjacencies  are  globally  realizable 
in  time  proportional  to  2P,  thereby  showing  that  the  first  phase  runs  in  time  0(2P). 

In  the  second  phase  the  r-cube  behavior  is  emulated  as  follows.  The  parallel 
step  pertaining  to  r-cube  dimension  p+j  can  no  longer  be  executed  in  one  time 
unit  ;  however,  using  repeated  circular  shift  within  the  cycles,  each  operand  can 
be  successively  brought  to  reside  for  one  time  unit  in  that  module  in  its  cycle 
which  is  connected  in  sheaf  j.  Although  this  processing  of  all  operands  in  a  cycle 
on  sheaf  j  now  requires  0(2P)  time  units,  this  computation  can  be  pipelined 
(overlapped)  with  the  analogous  computations  corresponding  to  all  other  sheaves, 
according  to  the  scheme  illustrated  in  figure  2.  The  sequence  of  steps  during 
which  a  given  sheaf  is  active  is  called  the  active  phase  of  that  sheaf  (for 
example,  steps  3-6  for  sheaf  3  in  figure  2). 

Thus,  the  second  phase  also  runs  in  time  0(2P),  and,  when  p  is  chosen  equal 
to  [log^(r-p)j  ,  processing  time  on  the  CCC  is  O(logn).  We  see  therefore  that, 
by  combining  the  principles  of  pipelining  and  parallelism,  the  CCC  can  emulate 
the  cube  with  no  significant  loss  of  performance.  In  the  sequel,  we  shall  assume 
chat 


Figure  2.  Illustration  of  the  pipelining  of  parallel  computations. 

A  "X"  denotes  a  step  at  which  a  given  sheaf  is  active. 

the  cycle  length  2P  satisfies  the  limitations  corresponding  to 

[log2(r-p)|  sps  [r/2j 


A. 


(!) 
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Finally,  as  concern  the  area  of  the  layout,  by  referring  to  figure  1  we 
readily  see  that  a  2p*2r  p  CCC  can  be  laid  out  on  a  chip  of  height  (2r  p+2p-r) 
and  width  (2r  p+  -1)  (in  the  chosen  units). 


3.  CYCLIC  SHIFT 

Let  T[0:2r-1]  be  an  array  of  n»2r  one-bit  operands  (for  any  other  operand 
length,  both  the  area  and  the  time  will  be  multiplied  by  a  constant) . 

We  describe  cyclic  shifts  to  the  left  by  t<n  positions  of  the  operands 
of  this  array  ;  although  dual  implementations  are  possible,  for  concreteness, 
we  describe  a  cyclic  shift  scheme  which  corresponds  to  an  algorithm  in  the 
ASCEND  class. 

1  t  r 

We  now  note  the  following  property  :  Assuming  that  T[0:2  -1]  and  T[2  :2  -1] 

1 

have  both  been  subjected  to  a  left-cyclic-shift  by  tmod2  positions,  the 
desired  final  configuration  is  obtained  by  the  following  alternative  exchanges  : 

if  (t>2r  1 )  then  foreach  k:0£k£2r  1-tmod2r  '-1  pardo  T[k]-*-*-T[2r  5+k]  odpar 

else  foreach  k:2r  '-tmod2r  'sk£2r  '  pardo  T[k]— »T[2r  *+k]  odpar 


The  proof  of  this  property  is  straightforward  (and  it  is  basically  supplied 
by  the  illustrations  in  figure  3).  Property  (2)  can  be  reformulated  as  follows  : 


Initial 

Configuration 


Intermediate 

Configuration 


t£2 


r-1 


t<2 


r-1 


Exchange 


__C 

(- 


Ur  t  -> 


Final 

Configuration 


B  |  C  1  D  1  A 


Figure  3.  Illustration  for  the  proof  of  Rule  (2) 


L 
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adjacent  pairs  on  dimension  r  (i.e.,  pairs  of  the  form  (j,j+2  ))  are  partitioned 

into  to  sets,  of  respective  sizes  tmod2r  '  and  (2r  !-tmod2r  *)  and  the  numbers 
of  the  former  or  of  the  latter  set  are  exchanged  depending  upon  whether  t<2r  1 
of  t£2r  1 .  Furthermore,  since  the  two  halves  of  the  array  T[0:2r-1]  are  treated 
in  exactly  the  same  way  as  regards  dimensions  l,2,...,r-l,  it  follows  by  induc¬ 
tion  on  decreasing  j  that  the  exchanges  pertaining  to  dimension  l£j£r  are 
completely  described  by  : 

if  (tmod2^>2^  *)  then  for  each  k:  k“2s.2^  *+v,  0SvS2^  '-tmod2J  '-1, 

CKsS2r~J-l  pardo  T[k]—^T[k+2J~I ]  odpar 

else  for  each  k:  k*2s.2^  '+v,  2^  '-tmod2^  'svS2^  '-1,  0ss22r  ^-1 

pardo  ^k]->-»-T[k-t-2^~1]  odpar  (3) 

Notice  that  in’ the  above  rule  (3)  a  crucial  role  is  played  by  the  parameter 

v,  which  defines  the  range  of  the  pairs  to  be  exchanged. 

♦  * 

We  now  propose  to  implement  the  described  cyclic  shift  operation  on  a  CCC-like 
network.  We  select  a  2P*2r  p  CCC-interconnection  so  that  dimension  j  of  the 
previous  abstract  description  (for  j>p)  corresponds  to  CCC  sheaf  j-p.  We  now 
observe  the  following  facts  : 

(i)  Refering  to  the  standard  layout  of  figure  1,  in  sheaf  £,  for  £  in  the 

range  (l,r-p),  adjacent  pairs  on  the  same  orizontal  line  are  characterized  by  the 

same  value  of  the  parameter  v,  as  defined  above.  This  means  that  all  pairs  on  a 

given  horizontal  line  of  the  layout  will  behave  identically  during  the  execution 

of  the  shift  algorithm,  whence  the  behavior  of  sheaf  £  is  completely  specified 

£-] 

by  the  behavior  of  modules  M[i,£-l2  in  cycles  i»0,l,...,2  -1.  Therefore,  as 

regards  sheaf  £,  we  may  restrict  ourselves  to  the  subarray  T[0:2P  ]  ;  in 

particular  according  to  rule  (3) ,  this  array  is  partitioned  into 
T[0:2P+^  *-tmod2p+^  '-1]  and  T[2P+^  1-tmod2p+^  *:2P+^  '-I],  and  either  the 
first  or  the  second  of  them  exchanged  on  sheaf  £. 

(ii)  tmod2p+^  *-q  2p+tmod2p ,  where  q»*  (cmod2p+^  ')/2p  .  Thus  modules 
M"i,£-1]  (0si*2  -1)  are  divided  into  three  sets, 

{M:0,£-1 ], . . . ,M[2£"I-q^-2,^-l I},  { M[ 2^“ ! -q£-l ,£-l ]} ,  and  {M[2^~1-q£,£-l I, . . . , 

M[2^  *-l,£-l]},  such  that  modules  in  the  first  and  third  set  have  fixed  behavior 

£-1 

during  their  active  phases  (either  exchange  or  no-exchange),  while  M[2  — q^— 1 , £— 1 2 

changes  its  behavior  after  the  first  2p-tmod2P  steps  of  its  active  phase. 
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(iii)  For  any  value  of  £  (i.e.,  for  all  sheaves)  the  quantity  (2P-tmod2p)  is 
independent  of  t.  This  means  that  all  modules  with  mixed  behavior,  the  change 
of  behavior  (from  exchange  to  no-exchange,  or  vice  versa)  occurs  after  the 
same  number  of  steps  during  their  active  phases. 

(iv)  The  LOWSHEAVES  phase  is  void.  Indeed  the  effect  of  a  left  cyclic  shift 
by  tmod2p  positions  within  each  cycle  is  implicitly  achieved  by  the  timing  of 
the  exchanges  in  the  HIGHSHEAVES  phase.  All  that  is  needed  initially  for  each 
cycle  is  a'  toward  cyclic  shift  by  one  position  so  that  T[i2p-1]  resides  in 
MCi.O],  for  i-0,l,...,2r_p-l. 

In  summary  the  shift  operation  can  be  controlled  as  follows.  Each  module  of 
the  CCC  is  assigned  two  bits,  bj  and  b7,  which  respectively  control  the  module 
operation  during  the  first  (2p-tmod2p)  and  the  last  tmod2p  steps  of  the  module 
active  phase.  Bit  b^  is  set  to  1  denote  "exchange"  and  to  0  otherwise.  The 
timing  of  the  possible  change  of  behavior  (between  step  (2P-tmod2p)  and  step 
(2p-tmod2p+l )  is  controlled  by  the  bit  sequence  (0) 2P-tmod2p j  (o)tmoclP>  which 
circulates  in  each  cycle  along  with  the  operands .  Thus  we  conclude  that  three 
control  bits  for  module  are  sufficient,  i.e.,  the  cyclic  shift  operation  has 
a  finite-state  module  control. 

Since  the  layout  of  figure  1  is  used  without  modification,  (and  the 

nodes  have  constant  area)  we  reach  the  following  conclusions.  For  P“|_r/2j  we 

obtain  a  CCC  whose  computation  time  is  0(2P)»0(v'n) ,  i.e.  a  "slow"  realization, 

2ct 

which,  however  is  optimal  for  the  AT  metric  (OSaSI)  :  in  fact,  referring  to 
the  expression  for  the  CCC  height  and  width  obtained  at  the  end  of  Section  2, 
we  have  :  A=0(2r)  and  T*0(2r//<"),  whence  AT^a«0(n' +a) .  If, on  the  other  hand, we  seek 
minimum  computation  time  O(logn),  the  corresponding  "fast"  CCC  is  obtained  by 
choosing  p«  (log, (r-p)l  v  llog_n.^  In  this  case  we  obtain  T*0(2P)«0(logn)  and 

2  ^  ^  2qj  o  ^ 

A»0((n/logn)  ),  whence  setting  a-l  in  AT  ,  we  obtain  AT~*0(n  ),  i.e.  the 
network  is  optimal  (notice  that  this  occurs  only  for  a*I). 

4.  INTEGER  MULTIPLICATION  AND  DISCRETE  FOURIER  TRANSFORM 

The  design  of  hardware  multipliers  is  not  a  new  problem  in  computer  science. 
The  classical  shift  and  add  method  multiplies  two  n  bits  integers  in  time  T*0(n) , 
within  a  circuit  area  A*0(n).  Furthermore,  such  a  circuit  is  laid  out  in  a 

ITT 


The  notation  "Hog"  is  used  in  this  paper  instead  of  the  more  common 
"loglog". 


rectangle  of  constant  width  0(1),  corresponding  to  a  few  wires,  and  height  0(n) 

proportional  to  the  number  of  bits.  Although  this  multiplier  does  not  meet 
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the  AT-0(n  fc)  and  AT  *0(n“)  bounds  of  Brent-Kung  [8],  it  proves  to  be  useful 
in  designing  optimal  VLSI  for  the  DFT  and  binary  multiplication. 

4.1.  CIRCUITS  FOR  DFT 

In  [10],  we  indicate  that  the  radix-2  FFT  algorithm  can  be  imple¬ 
mented  on  the  CCC  in  time  T  proportional  to  the  cycle  size  h.  Each  module  of 
the  machine  performs  one  of  seven  tasks  at  a  given  time  :  it  may  transmit 
operanas,  in  either  direction,  on  one  of  its  three  communication  lines,  or  it 
may  be  performing  an  internal  operation.  Internal  operations,  in  this  context, 
are  linear  combinations  of  the  form  (U,V)-*-(U+otV,  U-aV)  where  U  et  V  are  two 

operands  present  in  the  module,  and  a  is  an  appropriate  power  of 
2ir 

w*e  n  ,  a  primitive  n-th  root  of  unity.  In  fact,  the  successive  values  of 

T 

a  vary  with  time,  taking  the  form  Wq.Wj,  where  and  w ^  are  appropriate  powers 
of  n  ;  keeping  the  value  of  w  in  a  special  register  allows  to  update  a  with 
a  single  multiplication.  Internal  operations  can  thus  be  computed  in  each 
module  by  a  multiplication  a*u.w  ,  another  multiplication  V«-a.V,  and  a  final 
add-substract  step  (U,VV-(U+V,  U-V) .  Using  a  shift  and  add  multiplier,  and  a 
^ew  registers,  such  a  butterfly  module  can  be  implemented  on  a  chip  of  area 
0(logn)  proportional  to  the  number  of  bits  used  for  representing  each  of  the  n 
inputs.  As  for  the  multiplier,  this  butterfly  can  be  laid  out  in  a  rectangle 
of  width  0(1)  and  height  0(logn) .  The  n  butterflies  are  then  placed  on  the 
CCC  of  figure  1  as  indicated  in  figure  4.  It  should  be  pointed  out  that  in 
figure  4  the  horizontal  wires  realizing  the  sheaf  connection  can  be  inter¬ 
leaved  with  the  horizontal  wires  belonging  to  the  butterfly  modules;  this 
interleaving  at  most  doubles  the  height  of  the  latter  modules.  Thus,  the  width 
of  this  new  layout  for  a  2P  x  2r_p  CCC  (with  n  *  2P)  is  0(2r"p)  =  0(— );  we 
shall  now  evaluate  the  height  of  the  CCC.  2P 

As  we  see  from  figure  1,  in  each  cycle  there  are  two  sets  of  modules: 
the  set  of  the  sheaf-modules,  whose  lateral  port  is  used,  and  the  possibly 
empty  set  of  non-sheaf  modules,  which  we  shall  consider  first.  Let  H  be  the 
module  height. 

Each  row  of  2r  P  non-sheaf  modules  (there  are  2P-(r-p)  such  rows)  can 
be  laid  out  in  an  obvious  way,  using  height  H;  the  chip  height  used  to 


accommodate  these  rows  is  thus  (2  -r+p)H.  As  regards  sheaf  modules  -  although 

more  compact  placements  are  possible  —  we  just  assume  the  standard  placement 

i  (2 ) 

shown  in  figure  4,  where  sheaf  i  uses  height  2(H  +  2  )  .  Thus  the  (r-p) 

sheaves  contribute  height  2r  P+2+2H(r-p),  and  the  total  chip  height  is 
(2r-p+2  +  H(2P+r-p))  ■  0(^-  +  2Plogn),  since  r  *  logn  and  H  *  O(logn).  It 
follows  that  (provided  2P,  the  cycle  length,  is  upper-bounded  by  */n/logn  ; 

2  2  p 

we  already  know  that  2P  ^  logn)  the  total  CCC  area  is  A  *  0(n  /2  p). 


Figure  4. 


CCC. 


Processing  time  is  devoted  to  butterfly  operations  and  operand  trans¬ 
mission,  each  of  which  requires  time  O(logn).  There  are  2p  +  log(n/2P)  steps 
in  the  computation,  thus  total  time  is  T»0(2Plogn).  Since  we  have  just 
observed  that  2P  is  bounded  as  logn*  2P£\/n/logn,  for  any  choice  of  T  within 

the  bounds  log  n * T^ */nlogn  we  have  just  designed  networks  of  area  A  * 

2  2  2 

0(n  log  n/T  ),  thus  achieving  the  lower-bound  of  Thompson  [2]. 

Of  independent  interest  is  the  product  AT  which,  as  observed  by  Thompson 

[2],  is  proportional  to  the  amount  of  energy  spent  in  the  computation.  In 

3/2 

this  regard,  a  lower  bound  AT*Q(n  logn)  is  obtained  in  [2],  for  the  computa- 

3/2 

tion  of  the  DFT;  from  the  Lower  bound  AT=».(n  )  obtained  bv  Brent-Kung  [8] 


(2) 


The  factor  2  is  due  to  the  interleaving  of  module  wires  and  sheaf  connection 


for  binary  multiplication,  it  is  straightforward  however  to  conclude  that  any 
DFT  circuit  satisfies  the  bound  AT  =  Q(  (nlogn)3/2) .  This  last  bound  is  met  by 
a  "slow"  CCC  design,  with  the  choice  of  2P  =>  0( (n/logn)1/2)  for  cycle  size, 
yielding  a  circuit  for  the  DFT  with  values  A  *»  O(nlogn)  and  T  »  0(Vnlogn) . 

The  fastest  circuit,  among  those  described,  is  obtained  for  a  cycle  size 

2P  *  it  uses  an  area  A  *  O(n2/(logn)2)  for  computing  DFT  in  time 

0( (logn)2) . 

4.2.  CIRCUITS  FOR  INTEGER  MULTIPLICATION 

It  is  well-known  [13]  that  the  Discrete  Fourier  Transform  allows  to  compute 
convolution  products.  From  the  preceding  section,  we  know  that  we  can 
construct  circuits  computing  the  convolution  of  two  sequences  of  q  integers, 
each  integer  being  represented  on  logjq  bits,  with  the  following  characteristics  : 

a.T  «0(q“logq  )  for  any  T  such  that  (logq)  sTs/qlogq. 

In  [13],  Schonage  and  Strassen  show  that,  if  -.'■e2^11^  and  its  powers 
are  represented  with  5  log^q  bits,  and  the  arithmetic  is  carried  out  with  this 
precision,  the  approximate  error  in  computing  convolutions  via  FFT  remains 
confined  to  the  fractional  parts  of  the  terms  involved. 

In  order  to  compute  the  product  of  two  n  bits  integers,  we  divide  each 
operand  into  q  *  blocks  of  length  logn  bits  each,  and  we  compute  the  convolution 

of  the  two  sequences  of  q  integers.  The  exact  product  is  then  found  by 
"releasing  the  carries",  in  a  straightforward  manner. 

Setting  q*  in  the  expression  above  for  convolution  shows  that  we  have 

designed  circuits  for  binary  integer  multiplication  having  the  characteristics  : 

2  2  ->  _ 

A.T  “0(n  )  for  any  T  such  that  (logn) “sTSi'n. 

2  2 

These  circuits  meet  the  AT  »ft(n  )  bound  of  Brent  and  Rung  [8]  ;  in  this  class, 

3/2 

the  slow  circuit  corresponding  to  T*0(vn)  also  meets  the  ATn)(n  )  lower  bound  [8], 


Although  circuits  in  section  4  appear  to  be  too  complex  for  being  feasible 
on  one  chip  in  the  present  state  of  the  technology, the  sheer  existence  of,  say, 
an  A-O(n),  T-O(.'n)  multiplier  raises  interesting  very  practical  prospects. 
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